What is data science, and where did it come from? Is data science a new and exciting set of skills, necessary for analyzing 21st century data? Or is it a rebranding of statistics, which has carefully developed time-honored methods for data analysis over the past century?
Priority disputes – disagreements over who deserves credit for a new scientific theory or method – date back to the beginning of science. Famous examples include the invention of calculus and ordinary least squares. But this latest dispute calls into question the novelty of an entire discipline.
In this article, we use two popular data science algorithms to examine the difference between data science, statistics, and other occupations. We find that in terms of the preparation required to become a data scientist, data science reflects both the work of natural sciences managers – individuals who oversee research operations in the natural sciences – and statisticians and mathematicians. This suggests that data science is a shared enterprise among science and math, and thus those trained in the natural sciences have as much claim to data science as those trained in mathematics and statistics.
In terms of the role a data scientist serves relative to other occupations, however, we find that data science is essentially statistics. Both occupations are fast growing and central among the occupations that work with data. But while statistics and data science may serve nearly identical roles today, the centrality of statistics has declined over the past decade relative to other occupations. In contrast, the centrality of data science has grown, now surpassing statistics as the most central fast-growing occupation.
We examine the role of data science using data science
Everyone seems to agree that data science requires skills traditionally associated with a variety of different occupations. Drew Conway, for example, describes data science as a combination of math and statistics, substantive (domain) expertise, and “hacking” skills (see Figure 1). In dispute is the relative importance of those skills. Some have argued that data science is basically statistics – and that twentieth century statisticians like John Tukey have long possessed the data science skills traditionally associated with computer science and the natural sciences. Others have argued that data science is truly interdisciplinary, and statistical thinking only plays a small role. But while opinions on data science abound, few appear to be based on data or science.1
To that end, we use two popular data science algorithms, naïve Bayes and eigen centrality (eigen decomposition), to investigate the question: What is data science? Both algorithms use data listing the training a worker must generally complete to work in an occupation, such as data science. Specifically, we use the “CIP SOC Crosswalk” provided by the US Bureau of Labor Statistics and US National Center for Education Statistics, which links the Classification of Instructional Programs – the standard classification of educational fields of study into roughly 2,000 instructional programs – with the Standard Occupational Classification – the standard classification of professions into roughly 700 occupations.
Our main assumption is that the skills required to work in an occupation can be represented by the instructional programs that prepare students to work in that occupation. For example, the occupation “data scientists” is associated with 35 instructional programs, such as data science, statistics, artificial intelligence, computational science, mathematical biology, and econometrics. The occupation “statisticians” is associated with 26 instructional programs, including data science, statistics, and econometrics, but not artificial intelligence, computational science, or mathematical biology.
The algorithms we employ consider occupations to be similar if they have many instructional programs in common. Data scientists and statisticians share 14 degrees, suggesting they are similar: Half the programs that prepare students to be a statistician also prepare students to be a data scientist. In contrast, data scientists and computer programmers share six degrees in common, suggesting they are less similar; computer programmers have 17 degrees overall so only a third of the programs that prepare students to be a computer programmer also prepare students to be a data scientist.2
Data science is closest to statistics in its role among other occupations
We use eigen centrality (eigen decomposition) to measure the similarity of each occupation in terms of its role relative to other occupations. Specifically, we calculate the right singular vector of the adjacency matrix denoting whether an instructional program (row) is associated with an occupation (column).4 An occupation has high eigen centrality when the instructional programs that prepare a worker for that occupation also prepare that worker for many other occupations as well. This suggests that the higher the measure, the more central the role of the occupation relative to other occupations.
The eigen centrality of each occupation is displayed in Figure 3. Each point represents an occupation, the x-axis denotes the centrality of the occupation, and the y-axis denotes the percent growth of the occupation as predicted by the US Bureau of Labor Statistics over the next decade. The figure demonstrates that data scientists and statisticians occupy nearly identical positions: Both are fast growing and central to the other occupations that work with data. In contrast, natural sciences managers are central but growing much more slowly, suggesting a role closer to managers. We conclude that – though data scientists are prepared similarly to natural sciences managers – the role of the data scientist is essentially the role of the statistician.
But this role may be changing. Figure 4 shows the centrality (x-axis) of each occupation (y-axis) in 2010 and 2020. Green bars denote increases from 2010 to 2020 while yellow bars denote decreases. We find that while statisticians and data scientists may serve nearly identical roles today, the centrality of statisticians has declined over the past decade relative to other occupations. In contrast, the centrality of data scientists has grown, now surpassing statisticians as the most central fast-growing occupation. We conclude that though a data scientist and a statistician serve similar roles today, those roles may change as the workforce changes. Note that occupations definitions changed in 2018, and we used the crosswalk provided by the US Bureau of Labor Statistics to make these comparisons.
The findings in this section are based on the adjacency matrix that encodes whether an instructional program (row) is associated with an occupation (column). A more detailed summary of the matrix is provided in Figure 5, which depicts the matrix as a network graph. Larger nodes represent occupations that are fast growing, while nodes closer to the center of the network represent more central occupations. The figure is interactive. You can zoom in to see the similar positions between data science and statistics, which are both large (fast growing) and central.
Figure 5: A visualization of the network of occupations in which occupations are connected by the instructional programs that train students to work in multiple fields. We find that in terms of the role a data scientist serves relative to other occupations, however, data science and statistics occupy nearly identical positions. Occupations are colored according to the degrees leading to them: ● Agricultural/Animal/Plant/Veterinary Science And Related Fields (CIP 01); ● Education (CIP 13); ● Biological And Biomedical Sciences (CIP 26); ● Health Professions And Related Programs (CIP 51); ● Business, Management, Marketing, And Related Support Services (CIP 52); ● Health Professions Residency/Fellowship Programs (CIP 60); ● Medical Residency/Fellowship Programs (CIP 61); ● All other occupations.
Is data science statistics?
We conclude that individuals trained in managing natural sciences research – a slow growing occupation – are turning to data science—a much faster growing occupation, one which currently serves a role like that of a statistician. But if present trends continue, data science is poised to eclipse the historic role of the statistician as central to the occupations that work with data.
This suggests that while data science may be new and exciting, the role served by the data scientist is not particularly new. This does not mean that data scientists necessarily use the same time-honored methods for data analysis as statisticians. It is the authors’ experience, however, that many data science tools are in fact statistical. Indeed, the two data science algorithms we used in this article are both taught to students as new and exciting, but in reality, are centuries-old methods steeped in statistical history.
Regardless of whether data science is or is not statistics, the occupation data scientist has proven immensely popular, capturing a zeitgeist that has eluded statistics. This is best evidenced by the fact that data science – and not statistics – has been crowned the sexiest job of the twenty-first century. But if statistics has not enjoyed the popularity of data science, perhaps the real question in need of answering is: What is statistics?
- About the author
- Jonathan Auerbach is an assistant professor in the Department of Statistics at George Mason University. His research covers a wide range of topics at the intersection of statistics and public policy. His interests include the analysis of longitudinal data, particularly for data science and causal inference, as well as urban analytics, open data, and the collection, evaluation, and communication of official statistics.
- David Kepplinger is an assistant professor in the Department of Statistics at George Mason University. His research revolves around methods for robust and reliable estimation and inference in the presence of aberrant contamination in high-dimensional, complex data. He has active collaborations with researchers from the medical, biological, and life sciences.
- Nicholas Rios is an assistant professor of statistics at George Mason University. He earned his PhD in statistics 2022 from Penn State University, where his dissertation focused on designing optimal mixture experiments. His primary research interests are experimental design and methods for intelligent data collection in the presence of real-world constraints. He is also interested in functional data analysis, computational statistics, compositional data analysis, and the analysis of high-dimensional data.
- Copyright and licence
- © 2023 Jonathan Auerbach, David Kepplinger, and Nicholas Rios
This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo credit and licence goes here.
- How to cite
- Auerbach, Jonathan, David Kepplinger, and Nicholas Rios. 2023. “What is data science? A closer look at science’s latest priority dispute.” Real World Data Science, Month Day, Year. URL
References
Footnotes
Descriptions of occupations by government agencies are not particularly helpful in differentiating between data science, statistics, and related occupations. For example, according to the Bureau of Labor Statistics, data scientists use “analytical tools and techniques to extract meaningful insights from data.” This description is similar to mathematicians/statisticians, who “analyze data and apply computational techniques to solve problems,” and operations research analysts who use “mathematics and logic to help solve complex issues.”↩︎
Our analysis treats all instructional programs as equal and independent. We do not consider, for example, the number of workers that hold a degree from an instructional program or whether two instructional programs are similar or offered by similar academic departments. Our analysis could be adjusted to account for this or related information, although it is unclear to the authors whether such an adjustment would make the results more accurate.↩︎
Note that natural sciences managers share 18 instructional programs with data scientists, while statisticians share 14.↩︎
Or alternatively, the principal eigenvector of the adjacency matrix denoting the number of instructional programs each occupation (row) has in common with each other occupation (column).↩︎